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(54) Feature extraction and pattern recognition 



(57) It is intended to increase the recognition rate in 
speech recognition and image recognition. An observa- 
tion vector as input data, which represents a certain 
point in the observation vector space ; is mapped to a 



distribution having a spread in the feature vector space, 
and a feature distribution parameter representing the 
distribution is determined. Pattern recognition of the in- 
put data is performed based on the feature distribution 
parameter. 
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Description 

[0001] The present invention relates to a feature extraction apparatus and method and a pattern recognition appa- 
ratus and method. In particular, the invention relates to a feature extraction apparatus and method and a pattern rec- 
5 ognition apparatus and method which are suitable for use in a case where speech recognition is performed in a noisy 
environment. 

[0002] Fig. 1 shows an example configuration of a conventional pattern recognition apparatus. 
[0003] An observation vector as a pattern recognition object is input to a feature extraction section 101 . The feature 
extraction section 101 determines, based on the observation vector, a feature vector that represents its feature quantity. 
io The feature vector thus determined is supplied to a discrimination section 102. Based on the feature vector supplied 
from the feature extraction section 101, the discrimination section 102 judges which of a predetermined number of 
classes the input observation vector belongs lo. 

[0004] For example, where the pattern recognition apparatus of Fig. 1 is a speech recognition apparatus, speech 
data of each time unit (hereinafter referred to as a frame where appropriate) is input to the feature extraction section 

is 101 as an observation vector. The feature extraction section 101 acoustically analyzes the speech data as the obser- 
vation vector, and thereby extracts a feature vector as a feature quantity of speech such as a power spectrum, cepstrum 
coefficients, or linear prediction coefficients. The feature vector is supplied to the discrimination section 102. The dis- 
crimination section 102 classifies the feature vector as one of a predetermined number of classes. A classification 
result is output as a recognition result of the speech data (observation vector). 

20 [0005] Among known methods for judging which one of a predetermined number of classes a feature vector belongs 
to in the discrimination section 1 02 are a method using a Mahalanobis discriminant function, a mixed normal distribution 
function, or a polynomial function, a method using an HMM method : and a method using a neural network, 
[0006] For example, the details of the above speech recognition techniques are disclosed in "Fundamentals of 
Speech Recognition (I) and II))," co-authored by L. Ftabiner and B-H Juang, translation supervised by FuruK NTT 

25 Advanced Technology Corp., 1995. As lor the general pattern recognition, detailed descriptions are made in, for ex- 
ample, R. Duda and P. Hart, "Pattern Classification and Scene Analysis," John Wiley & Sons, 1973. 
[0007] Incidentally, when pattern recognition is performed, an observation vector (input pattern) as a pattern recog- 
nition object generally includes noise. For example, a voice as an observation vector that is input when speech rec- 
ognition is performed includes noise of an environment of a user's speech (e.g., voices of other persons or noise ol a 

30 car). To give another example, an image as an observation vector that is input when image recognition is performed 
includes noise of a photographing environment of the image (e.g., noise relating to weather conditions such as mist 
or rain, or noise due to lens aberrations of a camera for photographing the image). 

[0008] Spectral subtraction is known as one of feature quantity (feature vector) extraction methods that are used in 
a case of recognizing voices in a noise environment. 
35 [0009] In the spectral subtraction, an input before occurrence of a voice (i.e., an input before a speech section) is 
employed as noise and an average spectrum of the noise is calculated. Upon subsequent input of a voice, the noise 
average spectrum is subtracted from the voice and a feature vector is calculated by using a remaining component as 
a true voice component. 

[0010] For example, the details of the spectral subtraction are disclosed in S.F Boll, "Suppression of Acoustic Noise 
40 in Speech Using Spectral Subtraction," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP- 
27, No. 2, 1979; and P. Lockwood and J. Boudy. "Experiments with a Nonlinear Spectral Subtractor, Hidden Markov 
Models and the Projection, for Robust Speech Recognition in Cars,' Speech Communication, Vol. 11, 1992. 
[0011] Incidentally, it can be considered that the feature extraction section 101 of the pattern recognition apparatus 
of Fig. 1 executes a process that an observation vector a representing a certain point in the observation vector space 
45 is mapped to (converted into) a feature vector y representing a corresponding point in the feature vector space as 
shown in Fig. 2. 

[0012] Therefore, the feature vector y represents a certain point (corresponding to the observation vector a) in the 
feature vector space. In Fig. 2, each of the observation vector space and the feature vector space is drawn as a three- 
dimensional space. 

so [001 3] In the spectral subtraction, an average noise component spectrum is subtracted from the observation vector 
a and then the feature vector y is calculated. However since the feature vector y represents one point in the feature 
vector space as described above, the feature vector y does not reflect characteristics representing irregularity of the 
noise such as variance though it reflects the average characteristics of the noise. 

[0014] Therefore, the feature vector y does not sufficiently reflect the features of the observation vectors, and hence 
ss it is difficult to obtain a high recognhion rate with such a feature vector y. 

[0015] According to a first aspect of the invention, there is provided a leature extraction apparatus which extracts a 
feature quantity of input data, comprising calculating means for calculating a feature distribution parameter representing 
a distribution that is obtained when mapping of the input data is made to a space of a feature quantity of the input data. 
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[0016] According to a second aspect of the invention, there is provided a feature extraction method for extracting a 
feature quantity of input data, comprising the step of calculating a feature distribution parameter representing a distri- 
bution thai is obtained when mapping of the input data is made to a space of a feature quantity of the input data. 
[0017] According to a third aspect of the invention, there is provided a pattern recognition apparatus which recognizes 
5 a pattern of input data by classifying it as one of a predetermined number of classes, comprising calculating means 
for calculating a feature distribution parameter representing a distribution that is obtained when mapping of the input 
data is made to a space of a feature quantity of the input data; and classifying means for classifying the feature distri- 
bution parameter as one of the predetermined number of classes. 

[0018] According to a fourth aspect of the invention, there is provided a pattern recognition method for recognizing 
10 a pattern of input data by classifying it as one of a predetermined number of classes, comprising the steps of calculating 
a feature distribution parameter representing a distribution that is obtained when mapping of the input data is made to 
a space of a feature quantity of the input dala; and classifying the feature distribution parameter as one of the prede- 
termined number of classes. 

[0019] According to a fifth aspect of the invention, there is provided a pattern recognition apparatus which recognizes 
J5 a pattern of input data by classifying it as one of a predetermined number of classes, comprising framing means for 
extracting parts of the input data at predetermined intervals, and outputting each extracted data as 1 -frame dala; feature 
extracting means receiving the 1 -frame data of each extracted data, for outputting a feature distribution parameter 
representing a distribution that is obtained when mapping of the 1 -frame data is made to a space of a feature quantity 
of the 1-lrame data; and classifying means for classifying a series of feature distribution parameters as one of the 
20 predetermined number of classes. 

[0020] According to a sixth aspect of the invention, there is provided a pattern recognition method for recognizing a 
pattern of input data by classifying it as one of a predetermined number of classes, comprising a framing step of 
extracting parts of the input data at predetermined intervals, and outputting each extracted data as 1 -frame data; a 
feature extracting step of receiving the 1 -frame data of each extracted data, and outputting a feature distribution pa- 
25 rameter representing a distribution that is obtained when mapping of the 1 -frame data is made to a space of a feature 
quantity of the 1 -frame data; and a classifying step of classifying a series or feature distribution parameters as one of 
the predetermined number of classes. 

[0021] In the feature extraction apparatus according to the first aspect of the invention, the calculating means cal- 
culates a feature distribution parameter representing a distribution that is obtained when mapping of the input data is 
30 made to a space of a feature quantity of the input data. 

[0022] In the feature extraction method according to the second aspect of the invention, a feature distribution pa- 
rameter representing a distribution that is obtained when mapping of the input dala is made to a space of a feature 
quantity of the input data is calculated. 

[0023] In the pattern recognition apparatus according to the third aspect of the invention, the calculating means 
3S calculates a feature distribution parameter representing a distribution that is obtained when mapping of the input data 
is made to a space of a feature quantity of the input data, and the classifying means classifies the feature distribution 
parameter as one of the predetermined number of classes. 

[0024] In the pattern recognition method according to the fourth aspect of the invention, a feature distribution pa- 
rameter representing a distribution that is obtained when mapping of the input data is made to a space of a feature 
^0 quantity of the input data is calculated, and the feature distribution parameter is classified as one of the predetermined 
number of classes. 

[0025] In a pattern recognition apparatus according to the fifth aspect of the invention which recognizes a pattern of 
input data by classifying it as one of a predetermined number of classes, parts of the input data are extracted at 
predetermined intervals, and each extracted data is output as 1 -frame data. A feature distribution parameter repre- 
ss senting a distribution that is obtained when mapping of the 1 -frame data of each extracted is made to a space of a 
feature quantity of the 1 -frame data is output Then, a series of feature distribution parameters is classified as one of 
the predetermined number of classes. 

[0026] In a pattern recognition method according to the sixth aspect of the invention for recognizing a pattern of input 
data by classifying it as one of a predetermined number of classes, parts of the input data are extracted at predetermined 
so intervals, and each extracted data is output as 1 -frame data. A feature distribution parameter representing a distribution 
that is obtained when mapping of the 1 -frame data of each extracted data is made to a space of a feature quantity of 
the 1 -frame data is output. Then, a series of feature distribution parameters is classified as one of the predetermined 
number of classes. 

[0027] To allow better understanding, the following description of embodiments of the present invention is given by 
s$ way of non-limitative example, with reference to the drawings, in which: 

Fig. 1 is a block diagram showing an example configuration of a conventional pattern recognition apparatus; 
Fig. 2 illustrates a process of a feature extraction section 101 shown in Fig. 1; 
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Fig. 3 is block diagram showing an example configuration of a speech recognition apparatus according to an 
embodiment of the present invention; 

Fig. 4 illustrates a process of a framing section 1 shown in Fig. 3; 
Fig. 5 illustrates a process of a feature extraction section 2 shown in Fig. 3; 
s Fig. 6 is a block diagram showing an example configuration of the feature extraction section 2 shown in Fig. 3; 

Figs. 7 A and 7B show probability density functions of a noise power spectrum and a true voice power spectrum; 
Fig. 8 is a block diagram showing an example configuration of a discrimination section 3 shown in Fig. 3; 
Fig. 9 shows an HMM; and 

Fig. 10 is a block diagram showing another example configuration of the feature extraction section 2 shown in Fig. 3. 

10 

[0026] Fig. 3 shows an example configuration of a speech recognition apparatus according to an embodiment of the 
present invention, 

[0029] Digital speech data as a recognition object is input to a framing section 1 . For example, as shown in Fig. 4, 
the framing section 1 extracts parts of received speech data at predetermined time intervals (e.g., 10 ms; this operation 
is is called framing) and outputs each extracted speech data as 1 -frame data. Each 1 -frame speech data that is output 
from the framing section 1 is supplied to a feature extraction section 2 in the form of an observation vector a having 
respective time-series speech data constituting the frame as components. 

[0030] In the following, an observation vector as speech data of a t-th frame is represented by a(t), where appropriate. 
[0031 ] The feature extraction section 2 (calculating means) acoustically analyzes the speech data as the observation 

20 vector a that is supplied from the framing section 1 and thereby extracts a feature quantity rrom the speech data. For 
example, the feature extraction section 2 determines a power spectrum of the speech data as the observation vector 
a by Fourier-transforming it, and calculates a leature vector y having respective frequency components of the power 
spectrum as components. The method of calculating a power spectrum is not limited to Fourier transform; a power 
spectrum can be determined by other methods such as a filler bank method. 

25 [0032] Further, the feature extraction section 2 calculates, based on the above -calculated feature vector y, a param- 
eter (hereinafter referred to as a feature distribution parameter) Z that represents a distribution, in the space of a feature 
quantity (i.e., feature vector space), obtained when a true voice included in the speech data as the observation vector 
a is mapped to points in the feature vector space, and supplies the parameter 2 to a discrimination section 3. 
[0033] That is, as shown in Fig. 5, the feature extraction section 2 calculates and outputs, as a feature distribution 

30 parameter, a parameter that represents a distribution having a spread in the feature vector space obtained by mapping 
of an observation vector a representing a certain point in the observation vector to the feature vector space. 
[0034] Although in Fig. 5 each of the observation vector space and the feature vector space is drawn as a three- 
dimensional space, the respective numbers of dimensions of the observation vector space and the feature vector space 
are not limited to three and even need not be the same. 

35 [0035] The discrimination section 3 (classifying means) classifies each of feature distribution parameters (a series 
of parameters) that are supplied from the feature extraction section 2 as one of a predetermined number of classes, 
and outputs a classification result as a recognition result of the input voice. For example, the discrimination section 3 
stores discriminant functions to be used for judging which of classes corresponding to a predetermined number K of 
words a discrimination object belongs to : and calculates values of the discriminant functions of the respective classes 

40 by using, as an argument, the feature distribution parameter that is supplied from the feature extraction section 2. A 
class (in this case, a word) having the largest function value is output as a recognition result of the voice as the obser- 
vation vector a. 

[0036] Next, the operation of the above apparatus will be described. 

[0037] The framing section 1 frames input digital speech data as a recognition object. Observation vectors a of 
45 speech data of respective frames are sequentially supplied to the feature extraction section 2. The feature extraction 
section 2 determines a feature vector y by acoustically analyzing the speech data as the observation vector a that is 
supplied from the framing section 1. Further based on the feature vector y thus determined, the feature extraction 
section 2 calculates a feature distribution parameter Z that represents a distribution in the feature vector space, and 
supplies it to the discrimination section 3. The discrimination section 3 calculates, by using the feature distribution 
so parameter supplied from the feature extraction section 2, values of the discriminant functions of the respective classes 
corresponding to the predetermined number K of words : and outputs a class having the largest function value as a 
recognition result of the voice. 

[0038] Since speech data as an observation vector a is converted into a feature distribution parameter Z that repre- 
sents a distribution in the feature vector space (space of a feature quantity of speech data) as described above, the 
55 feature distribution parameter 2 reflects distribution characteristics of noise included in the speech data. Further, since 
the voice is recognized based on such a feature distribution parameter Z, the recognition rate can greatly be increased. 
[0039] Fig. 6 shows an example configuration of the feature extraction section 2 shown in Fig. 3. 
[0040] An observation vector a is supplied to a power spectrum analyzer 1 2. The power spectrum analyzer 1 2 Fourier- 
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transforms the observation vector a according to, for instance, a FFT (fast Fourier transform) algorithm, and thereby 
determines (extracts), as a feature vector, a power spectrum that is a feature quantity of the voice. It is assumed here 
that an observation vector a as speech data of one frame is converted into a feature vector that consists of D compo- 
nents (i.e., a D-dimensional feature vector). 

[0041] Now, a feature vector obtained from an observation vector aft) of a t-th frame is represented by y(t). Further, 
a true voice component spectrum and a noise component spectrum of the feature vector y(t) are represented by x(t) 
and u(t), respectively. In this case, the component spectrum x(t) of the true voice is given by 




x(l) = y(l)-u(t) (1) 

where it is assumed that noise has irregular characteristics and that the speech data as the observation vector a(t) is 
the sum of the true voice component and the noise. 

[0042] Since (he noise u(t) has irregular characteristics, u(l) is a random variable and hence x(l), which is given by 
Equation (1), is also a random variable. Therefore, for example, if the noise power spectrum has a probability density 
function shown in Fig. 7 A, the probability density function of the power spectrum of the true voice is given as shown 
in Fig. 7B according to Equation (1). The probability that the power spectrum of the true voice has a certain value is 
obtained by multiplying, by a normalization factor that makes the probability distribution of the true voice have an area 
of unity, a probability that the noise power spectrum has a value obtained by subtracting the above value of the power 
spectrum ol the true voice from the power spectrum of the input voice (input signal). Figs. 7A and 7B are drawn with 
an assumption that the number ol components of each of u(t), x(t), and y(t) is one (D = 1 ). 

[0043] Returning to Fig. 6, the feature vector y(l) obtained by the power spectrum analyzer 1 2 is supplied to a switch 
13. The switch 13 selects one of terminals 13a and 13b under the control of a speech section detection section 11. 
[0044] The speech section detection section 11 detects a speech section (i.e. , a period during which a user is speak- 
ing). For example, the details of a method of detecting a speech section are disclosed in J.C. Junqua. B. Mark, and B. 
Reaves. "A Robust Algorithm for Word Boundary Detection in the Presence of Noise." IEEE Transaction Speech and 
Audio Processing, Vol. 2. No. 3, 1994. 

[0045] A speech section can be recognized in other ways, for example, by providing a proper button in the speech 
recognition apparatus and having a user manipulate the button while he is speaking. 

[0046] The speech section detection section 11 controls the switch 1 3 so that it selects the terminal 1 3b in speech 
sections and the terminal 1 3a in the other sections (hereinafter referred to as non-speech sections where appropriate). 
[0047] Therefore : in a non-speech section, the switch 1 3 selects the terminal 1 3a, whereby an output of the power 
spectrum analyzer 12 is supplied to a noise characteristics calculator 14 via the switch 13. The noise characteristics 
calculator 14 calculates noise characteristics in a speech section based on the output of the power spectrum analyzer 
12 in the non-speech section. 

[0048] In this example, the noise characteristics calculator 14 determines average values (average vector) and var- 
iance (a variance matrix) of noise with assumptions that a noise power spectrum u(t) in a certain speech section has 
the same distribution as that in the non-speech section immediately preceding that speech section and that the distri- 
bution is a normal distribution. 

[0049] Specifically, assuming that the first frame of the speech section is a No. 1 frame (t = 1 ), an average vector u/ 
and a variance matrix T of outputs y(-200) to y(-101) of the power spectrum analyzer 12 of 100 frames (from a frame 
preceding the speech section by 200 frames to a frame preceding the speech section by 101 frames) are determined 
as noise characteristics in the speech section. 

[0050] The average vector a/ and the variance matrix L' can be determined according to 
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where n' (i) represents an ith component of the average vector p.* (i = 1,2, .... D), y(t)(i) represents an ith component 
of a feature vector of a t-th frame, and X'(i, j) represents an ith-row. jth-column component of the variance matrix V (j 
55 =1,2 D). 

[0051] Here, to reduce the amount of calculation, it is assumed that for noise the components of the feature vector 
y have no mutual correlation. In this case, the components other than the diagonal components of the variance matrix 
V are zero as expressed by 
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Z'(i,j)=0, H\ (3) 

[0052] The noise characteristics calculator 14 determines the average vector u,' and the variance matrix V as noise 
5 characteristics in the above -described manner and supplies those to a feature distribution parameter calculator 15. 
[0053] On the other hand, the switch 13 selects the terminal 1 3b in the speech section, whereby an output of the 
power spectrum analyzer 12, that is, a feature vector y as speech data including a true voice and noise, is supplied to 
a feature distribution parameter calculator 15 via the switch 13. Based on the feature vector y that is supplied from the 
power spectrum analyzer 12 and the noise characteristics that are supplied from the noise characteristics calculator 
to 1 4, the feature distribution parameter calculator 1 5 calculates a feature distribution parameter that represents a distri- 
bution of the power spectrum of the true voice (distribution of estimated values). 

[0054] That is, with an assumption that the power spectrum ol the true voice has a normal distribution, the feature 
distribution parameter calculator 1 5 calculates, as a feature distribution parameter, an average vector <; and a variance 
matrix y or the distribution according to the following formulae: 

1S 

-E0r<()(i)-ut:)(iJ| 

"//""VOX.,-*,.),.,, +IM 

J a >Nt){i)j<M0ti) 

yft){f)J^' tmt V(u<t)(ij)Cb(i)(ij-J" r:,1|,) u{t)(f)P{ t /(i)fi)jdu(tj(ij 
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If i = j, 

3S v|/(t)(i, j>=V[x(t)(i>) 

= E[ (x(t) (i) )*) - (E(x(t) (i) ) ) 2 
(= E[ (x (t) (i) ) J ] - <£(t) (i) ) J ) • 

40 

If i * j, 

MMt) (i, j) = 0. 

45 
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E[(*<i)(i)) 2 )-E((y(t)(i)-u(l)(i)) J J 
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P(u(t)(i))= 1 e 2Z ' (7 ) 
J 2 n V (i,i) v ' 

[00S5] In the above formulae, £(t) (i) represents an ith component of an average vector c(1) of a t-th frame, E[] means 
an average value of a variable in brackets "Q," and x(t)(i) represents an ith component of a power spectrum x(t) of the 
true voice of the t-th frame. Further, u(t)(i) represents an ith component of a noise power spectrum of the t-th frame, 
and P(u(t)(i)) represents a probability that the ith component of the noise power spectrum of the t-th frame is u(t)(i). In 
this example, since the noise distribution is assumed to be a normal distribution, P(u(t)(i)) is given by Equation (7). 
[0056] Further, y(t)(i, j) represents an ith-row. jth-column component of a variance matrix y(t) of the t-th frame, and 
V[] means variance of a variable in brackets "[] " 

[0057] In the above manner, the feature distribution parameter calculator 15 determines, for each frame, an average 
vector ^ and a variance matrix y as a feature distribution parameter representing a distribution of the true voice in the 
feature vector space (i.e., a normal distribution as assumed to be a distribution of the true voice in the feature vector 
space). 

[0056] Then, when the speech section has finished, the switch 1 3 selects the terminal 1 3a and the feature distribution 
parameter calculator 15 outputs the feature parameter that has been determined for each frame in the speech section 
are output to the discrimination section 3. That is, assuming that the speech section consists of T frames and that a 
feature distribution parameter determined for each of the T frames is expressed as z(t) = g(t), y(t)} where 1=1,2..., 
T, the feature distribution parameter calculator 15 supplies a feature distribution parameter (a series of parameters) 2 
= (z(1), z(2) ; .... z(T)} to the discrimination section 3. 

[0059] The feature extraction section 2 thereafter repeats similar processes. 

[0060] Fig. 8 shows an example configuration of the discrimination section 3 shown in Fig. 3. 

[0061] The feature distribution parameter Z that is supplied from the feature extraction section 2 (feature distribution 

parameter calculator 15) is supplied to K discriminant function calculation sections 21 r 21 K . The discriminant function 

calculation section 21 k stores a discriminant function g k (Z) for discrimination of a word corresponding to a kth class of 

the K classes (k = 1, 2, .... K). and the discriminant function g k (Z) is calculated by using, as an argument, the feature 
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distribution parameter Z that is supplied from the feature extraction section 2. 

[0062] The discrimination section 3 determines a word as a class according to an HMM (hidden Markov model) 
method, for example. 

[0063] In this embodiment, for example, an HMM shown in Fig. 9 is used. In this HMM, there are H states q,-q H and 
s only a self-transition and a transition to the right adjacent state are permitted. The initial state is the leftmost state 
and the final state is the rightmost state q H , and a state transition Irom the final state c^, is prohibited. A model in which 
no transition occurs to states on the left of the current state is called a left-to-right model. A lef t-to-right model is generally 
employed in speech recognition. 

[0064] Now, a model for discrimination of a kth class of the HMM is called a kth class model. For example, the kth 
io class model is defined by a probability (initial state probability) * k (q h ) that the initial state is a state c^, a probability 
(transition probability) a k (q,, qj) that a state q { is established at a certain time point (frame) t and a state transition to a 
stale qj occurs at the next time point t+1 , and a probability (output probability) b^qj) (O) that a slate q, outputs a feature 
vector O when a state transition occurs from the state q, (h = 1, 2 H). 

[0065] When a feature vector series O t , 0 2 . ... is supplied, the class of a model having, for example, a highest 
*s probability (observation probability) that such a feature vector series is observed is selected as a recognition result of 
the feature vector series. 

[0066] In this example, the observation probability is determined by using the discriminant function g^Z). That is. 
the discriminant lunction g k (Z) is given by the following equation as a function for determining a probability that the 
feature distribution parameter (series) Z = {z,, z 2 = .... z T J is observed in an optimum state series (i.e., an optimum 
20 manner ol state transitions) for the feature distribution parameter (series) Z = {z t , z 2 z T ). 

g K (Z)- ma* rr v (qi> - iMq,)<Z|) • a,(q,.q 2 ) • b^(q ? )(2 2 ) 

CO,.- '.Ot 

2S ' ' ' ax(qT-i.qi) * b' k (q T )(ZT) (8) 



35 



[0067] In the above equation. b k '(qj)(Zj) represents an output probability for an output having a distribution Zj. In this 
embodiment, k>r example, an output probability b k (s) (O,), which is a probability that each feature vector is output at a 
state transition, is expressed by a normal distribution function with an assumption that components in the feature vector 
space have no mutual correlation. In this case, when an input has a distribution z,, an output probability b k '(s)(z,) can 
be determined by the following equation that includes a probability density function P k m (s) (x) that is defined by an 
average vector ji k (s) and a variance matrix Z k (s) and a probability density function P f (l)(x) that represents a distribution 
of a feature vector. (in this embodiment, a power spectrum) of a t-th frame. 

b'.(s)(2.)-/p { (t)(x)?T(s]tx}dx 

- n P<sK.)(£(t)(i). Y(0(i.i)) 

K-1.2.---.K : s-q,.q 2 - • -.qy: T-1.2- ",T (g) 



In Equation (9), the integration interval of the integral is the entire D-dimensional feature vector space (in this example, 
the power spectrum space). 
45 [0068] In Equation (9), P(s) (i) (?(1) (i), \j/(t) (i ( i)) is given by 



P(s)(')(40)(i),V(t)(i.i)) 



1 2(£ k (s) ri.i) + yft) fM)) 

i e (10) 

J2 n (X k (s)(u) + y(t)(i.i)) 



where *i k (s)(i) represents an ith component of an average vector u. k (s) and Z k (s) (i, i) represents an ith-row, ith-column 
component of a variance matrix X k (s). The output probability of the kth class model is defined by the above equations. 
[0069] As mentioned above, the HMM is defined by the initial state probabilities n k (q h ), the transition probabilities 
(q jt qj), and the output probabilities b k (q f ) (O), which are determined in advance by using feature vectors that are 
calculated based on learning speech data. 
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[0070] Where the HMM shown in Fig. 9 is used, transitions start from the leftmost state q v Therefore, the initial 
probability of only the state q, is 1 and the initial probabilities of the other states are 0. As seen from Equations (9) and 
(10), if terms y(t)(i, i) are 0, the output probability is equal to an output probability in a continuous HMM in which the 
variance of feature vectors is not taken into consideration. 

s [0071] An example of an HMM learning method is a Baum-Welch re-estimation method. 

[0072] The discriminant function calculation section 21 k shown in Fig. 6 stores, for the kth class model, the discrimi- 
nant function g k (2) of Equation (8) that is defined by. the initial state probabilities rc k (q h ), the transition probabilities a k 
(q it qj), and the output probabilities b k (qj) (O) which have been determined in advance through learning. The discriminant 
function calculation section 21 k calculates the discriminant function g k (Z) by using a feature distribution parameter Z 

io that is supplied from the feature extraction section 2, and outputs a resulting function value (above-described obser- 
vation probability) gJZ) to a decision section 22. 

[0073] The decision section 22 determines a class to which the feature distribution parameter Z, thai is, the input 
voice, belongs to by applying, tor example, a decision rule of the following formula to function values g k (Z) that are 
supplied from the respective determinant function calculation sections 21 r 21 k (i.e., the input voice is classified as one 
*s of the classes). 




C(2)-C, . ti g,(Z)-rr.a*{g ( (Z)) 

where C(Z) is a function of a discrimination operation (process) lor determining a class to which the feature distribution 
parameter Z belongs to. The operation "max" on the right side of the second equation of Formula (11) means the 
maximum value of function values g ; (Z) following it (i = 1, 2 K). 

[0074] The decision section 22 determines a class according to Formula (11) and outputs it as a recognition result 
of the input voice. 

[0075] Fig. 10 shows another example configuration of the feature extraction section 2 shown in Fig. 3. The compo- 
nents in Fig. 10 having the corresponding components in Fig. 6 are given the same reference symbols as the latter. 
That is : this feature extraction section 2 is configured basically in the same manner as that of Fig. 6 except that a noise 
buffer 31 and a feature distribution parameter calculator 32 are provided instead of the noise characteristics calculator 
14 and the feature distribution parameter calculator 15, respectively. 

[0076] In this example, for example, the noise buffer 31 temporarily stores : as noise power spectra, outputs of the 
power spectrum analyzer 12 in a non-speech section. For example, the noise buffer 31 stores, as noise power spectra, 

w ( 1 ). w(2) w(100) that arc respectively outputs y(-200), y(-199) y(-101) of the power spectrum analyzer 12 of 

100 frames that precede a speech section by 200 frames to 101 frames, respectively. 

[0077] The noise power spectra w(n) of 100 frames (n = 1 . 2 N; in this example, N = 100) are output to the feature 

distribution parameter calculator 32 when a speech section has appeared. 

[0078] When the speech section has appeared and the feature distribution parameter calculator 32 has received the 
noise power spectra w(n) (n = 1, 2. .... N) from the noise buffer 31, the feature distribution parameter calculator 32 
calculates ! for example, according to the following equations, an average vector ^(t) and a variance matrix L(t) that 
define a distribution (assumed to be a normal distribution) of a power spectrum of a true voice (i.e., a distribution of 
estimated values of the power spectrum of the true voice). 

4"(D(i)-E[x(0(i)] 

-TT * (y(l)(i)-w(n)(i» 

n n - 1 

1 M 

V(t)(i.j)-— ((y(l)(i)-w(n)(i)-£(t)(i)) 

X(y(0(j)-w(n)(j)-<£(t)(j))) (12) 
j-1 ,2, • • ■ ,D : j-1 ,2,* • *,D 



where w(n)(i) represents an ith component of an nth noise power spectrum w(n) (w(n)(j) is defined similarly). 
[0079] The feature distribution parameter calculator 32 determines an average vector q(t) and a variance matrix Z 
(t) for each frame in the above manner, and outputs a feature distribution parameter Z = {z 1f z 2 , z T } in the speech 
section to the discrimination section 3 (a feature distribution parameter z, is a combination of £(t) and £(t)). 
[0080] While in the case of Fig. 6 it is assumed that components of a noise power spectrum have no mutual corre- 
lation, in the case of Fig. 10 a feature distribution parameter is determined without employing such an assumption and 
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hence a more accurate feature distribution parameter can be obtained. 

[0081] Although in the above examples a power spectrum is used as a feature vector (feature quantity), a cepslrum, 
for example, can also be used as a feature vector. 

[0082] Now assume that x«(t) represents a cepstrum of a true voice of a certain frame t and that its distribution 
(distribution of estimated values of the cepstrum) is a normal distribution, for example. An average vector £ c (t) and a 
variance matrix ^(t) that define a probability density function P f (t)(x c ) that represents a distribution of a feature vector 
(in this case, a cepstrum) x c of the t-th frame can be determined according to the following equations. 



£ C (i)(i)— JT Z x c (t)(n)(i) 



N~n-i * <l,(n,{IJ 1=1.2.- * '.D 

1 N 

Cl>Ci-J>— 7j- ft I (* e {l)(nKi)-f e (0(i))(x e (t)(n){jJ-f c {t){j)) 
i=1.2.- • -.0 : j=1,2.* • -.0 



(13) 



where ^ c (l)(i) represents an ilh component of the average vector § c (t), ^(OO. j) is an ith-row, jth-column component of 
the variance matrix ^(t), and x c (l)(n)(i) is an ilh component of a cepstrum x c (l)(n) that is given by the following equations. 

x c (t)(n) = Cx L (t)(n) 
x L (l)(n) = (x L (l)(n)(1 ), x L (t)(n)(2), ... t x L (t)(n)(D)) 

x L (t)(n)(i) = !og(y(t)(i) - w(n)(i)) (1 4) 



30 where i = 1 , 2 D. In the first equation of Equations (14). C is a DCT (discrete cosine transform) matrix. 

[0083] Where a cepstrum is used as a feature vector, the feature extraction section 2 of Fig. 3 may determine an 
average vector q c (t) and a variance matrix y^t) for each frame in the above manner, and output a feature distribution 
parameter Z c = {z-, 0 , z 2 c : .... z T c ) in a speech section to the discrimination section 3 (a feature distribution parameter 
z, c is a combination (£ c (t), y c (t)]. 
35 [0084] In this case, an output probability b k '(s)(z, c ) ; which is used to calculate a discriminant function g^{Z c ) in the 
discrimination section 3, can be determined, as a probability representing a distribution in the cepstrum space, by the 
following equation that includes a probability density function p k m (s) (x c ) that is defined by an average vector u k c (s) 
and a variance matrix £ K c (s) and a probability density function P f (t) (x°) that represents a distribution of a feature vector 
(in this case, a cepstrum) of a t-th frame. 



40 



b k (s)(z^)=/P f (x c )Pk (s)<x C )dx c 

\ U C (0 - Mr <s> > T <H» C (I) +Er<3J K$ '^k < s > ) 

: (15) 



(2n)2 | v c (t)+^(s)| 



In Equation (15), the integration interval of the integral is the entire D-dimensional feature vector space (in this case, 
cepstrum space). The term £ c (t) - u. k c (s)) T is a transpose of a vector £«(t) - u. k c (s). 

[0085] Since, as described above, a feature distribution parameter is determined that reflects noise distribution char- 
acteristics and speech recognition is performed by using the thus-determined feature distribution parameter the rec- 
ognition rate can be increased. 

[0086] Table 1 shows racognition rates in a case where a speech recognition (word recognition) experiment utilizing 
the feature distribution parameter was conducted by using a cepstrum and an HMM method as a feature quantity of 
speech and a speech recognition algorithm of the discrimination section 3, respectively and recognition rates in a case 
where a speech recognition experiment utilizing the spectral subtraction was conducted. 
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Table 1 





Recognition rate (%) 


Speech input environment 


SS method 


Invention 


Idling and background music 


72 


86 


Running in city area 


65 


90 


Running on expressway 


57 


69 



15 



25 



35 



55 



[0087] In the above experiments, the number of recognition object words was 5,000 and a speaker was an unspecific 
person. Speaking was performed in three kinds of environments, that is, an environment that the car was in an idling 
state and background music is heard, an environment that the car was running in a city area, and an environment that 
the car was running on an expressway. 

[0088] As seen from Table 1 , in any of those environments, a higher recognition rate was obtained by the speech 
recognition utilizing the feature distribution parameter. 

[0089] The speech recognition apparatus to which the invention is applied has been described above. This type of 
speech recognition apparatus can be applied to a car navigation apparatus capable of speech input and other various 
apparatuses. 

[0090] In the above embodiment, a feature distribution parameter is determined which reflects distribution charac- 
teristics of noise. It is noted that, for example, the noise includes external noise in a speaking environment as well as 
characteristics of a communication line (when a voice that is transmitted via a telephone line or some other commu- 
nication line is to be recognized). 

[0091 ] For example, the invention can also be applied to learning for a particular speaker in a case of specific speaker 
recognition. In this case, the invention can increase the learning speed. 

[0092] The invention can be applied to not only speech recognition but also pattern recognition such as image rec- 
ognition. For example, in the case of image recognition, the image recognition rate can be increased by using a feature 
distribution parameter that reflects distribution characteristics of noise that is lens characteristics of a camera lor pho- 
tographing images, weather slates, and the like. 

[0093] In the above embodiment, a feature distribution parameter that represents a distribution in the power spectrum 
space or the cepslrum space is determined. However, other spaces such as a space of linear prediction coefficients, 
a space of a difference between cepstrums of adjacent frames, and a zero-cross space can also be used as a space 
in which to determine a distribution. 

[0094] In the above embodiment, a feature distribution parameter representing a distribution in a space of one (kind 
of) feature quantity of speech is determined. However, it is possible to determine feature distribution parameters in 
respective spaces of a plurality of feature quantities ol speech. It is also possible to determine a feature distribution 
parameter in one or more of spaces ol a plurality of feature quantities of speech and perform speech recognition by 
using the feature distribution parameter thus determined and feature vectors in the spaces of the remaining feature 
quantities. 

[0095] In the above embodiment, a distribution of a feature vector (estimated values of a feature vector of a true 
voice) in the leature vector space is assumed to be a normal distribution, and a leature distribution parameter repre- 
senting such a distribution is used. However, other distributions such as a logarithmic normal probability distribution, 
a discrete probability distribution, and a fuzzy distribution can also be used as a distribution to be represented by a 
feature distribution parameter. 

[0096] Further, in the above embodiment, class discrimination in the discrimination section 3 is performed by using 
an HMM in which the output probability is represented by a normal distribution. However, it is possible to perform class 
discrimination in the discrimination section 3 in other ways, for example, by using an HMM in which the output probability 
is represented by a mixed normal probability distribution or a discrete distribution, or by using a normal probability 
distribution function, a logarithmic probability distribution function, a polynomial function, a neural network, or the like. 
[0097] As described above, in the feature extraction apparatus and method according to the invention, a feature 
distribution parameter representing a distribution that is obtained when mapping of input data is made to a space of a 
feature quantity of the input data is calculated. Therefore, for example, when input data includes noise, a parameter 
that reflects distribution characteristics of the noise can be obtained. 

[0098] In the pattern recognition apparatus and method according to the invention, a feature distribution parameter 
representing a distribution that is obtained when mapping of input data is made to a space of a feature quantity of the 
input data is calculated, and the feature distribution parameter is classified as one of a predetermined number of 
classes. Therefore, for example, when input data includes noise, a parameter that reflects distribution characteristics 
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of the noise can be obtained. This makes it possible to increase the recognition rate of the input data. 



Claims 

5 

1. A feature extraction apparatus operative to extract a feature quantity of input data, comprising: 

calculating means for calculating a feature distribution parameter representing a distribution that is obtained 
when mapping of the input data is made to a space of a feature quantity of the input data. 

10 

2. The feature extraction apparatus according to claim 1, wherein the calculating means calculates a feature distri- 
bution parameter that represents a normal probability distribution. 

3. The feature extraction apparatus according to claim 1 , wherein the calculating means calculates a feature distri- 
is button parameter that represents a logarithmic normal probability distribution, 

4. The feature extraction apparatus according to claim 1. wherein the calculating means calculates a feature distri- 
bution parameter that represents a discrete probability distribution. 

20 5. The feature extraction apparatus according to claim 1. wherein the calculating means calculates a feature distri- 
bution parameter that represents a fuzzy distribution. 

6. The feature extraction apparatus according to claim 1. wherein the calculating means calculates the feature dis- 
tribution parameter in a space of at least one of plural kinds of feature quantities of the input data. 

25 

7. A feature extraction method for extracting a feature quantity of input data, comprising the step of; 

calculating a feature distribution parameter representing a distribution that is obtained when mapping of the 
input data is made to a space of a feature quantity of the input data. 

30 

8. A pattern recognition apparatus operative to recognize a pattern of input data by classifying it as one of a prede- 
termined number of classes, comprising: 

calculating means for calculating a feature distribution parameter representing a distribution that is obtained 
35 when mapping of the input data is made to a space of a feature quantity of the input data; and 

classifying means for classifying the feature distribution parameter as one of the predetermined number of 
classes. 

9. The pattern recognition apparatus according to claim 8, wherein the calculating means calculates a feature distri- 
40 bution parameter that represents a normal probability distribution. 

10. The pattern recognition apparatus according to claim 8, wherein the calculating means calculates a feature distri- 
bution parameter that represents a logarithmic normal probability distribution. 

45 11. The pattern recognition apparatus according to claim 8, wherein the calculating means calculates a feature distri- 
bution parameter that represents a discrete probability distribution. 

12. The pattern recognition apparatus according to claim 8, characterized in that the calculating means calculates a 
feature distribution parameter that represents a fuzzy distribution. 

so 

13. The pattern recognition apparatus according to claim 8, wherein the calculating means calculates the feature 
distribution parameter in a space of at least one of plural kinds of feature quantities of the input data ; and wherein 
the classifying means classifies the remaining kinds of feature quantities and the feature distribution parameter 
as one of the predetermined number of classes. 



55 



14. The pattern recognition apparatus according to any one of claims 8 to 13, wherein the classifying means judges, 
by using at least one normal probability distribution function, which of the predetermined number of classes the 
feature distribution parameter belongs to. 
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15. The pattern recognition apparatus according to any one of claims 8 to 13, wherein the classifying means judges, 
by using at least one polynomial function, which of the predetermined number of classes the feature distribution 
parameter belongs to 

s 16. The pattern recognition apparatus according to any one ol claims 8 to 1 3, wherein the classifying means judges, 
by using at least one hidden Markov model method, which of the predetermined number of classes the feature 
distribution parameter belongs to. 

17. The pattern recognition apparatus according to any one of claims 8 to 13. wherein the classifying means judges, 
10 by using at least one neural network, which of the predetermined number of classes the feature distribution pa- 
rameter belongs to. 

18. The pattern recognition apparatus according to any one of claims 8 to 17, wherein the input data is speech data. 

is 19. The pattern recognition apparatus according to claim 18, wherein the calculating means calculates the feature 
distribution parameter by using the speech data and information relating to noise. 

20. The pattern recognition apparatus according to claim 18 or 1 9 : wherein the calculating means calculates a feature 
distribution parameter that represents a distribution in a power spectrum space, a cepslrum space or a speech 

20 magnitude space of the speech data. 

21. A pattern recognition method lor recognizing a paltem of input data by classifying it as one of a predetermined 
number of classes, comprising the steps ol; 

25 calculating a feature distribution parameter representing a distribution that is obtained when mapping of the 

input data is made to a space of a feature quantity of the input data; and 
classifying the feature distribution parameter as one of the predetermined number of classes. 

22. A pattern recognition apparatus operative to recognize a pattern of input data by classifying it as one of a prede- 
30 termined number of classes, comprising: 

framing means for extracting parts of the input data at predetermined intervals, and oulputting each extracted 
data as 1 -frame data; 

feature extracting means receiving the 1 -frame data of each extracted data, for outputting a feature distribution 
35 parameter representing a distribution that is obtained when mapping of the 1 -frame data is made to a space 

of a feature quantity of the 1 -frame data; and 

classifying means tor classifying a series of feature distribution parameters as one of the predetermined 
number of classes. 

^0 23. The pattern recognition apparatus according to claim 22, wherein the input data is speech data. 

24. The pattern recognition apparatus according to claim 22 or 23, wherein the feature extracting means comprises: 

spectrum analyzing means for making an analysis of a spectrum of data including the 1 -frame data and out- 
*s putting the spectrum; 

noise characteristic calculating means for calculating and outputting a noise characteristic; and 
feature distribution parameter calculating means for calculating a feature distribution parameter representing 
a distribution of the spectrum of the 1 -frame data based on the spectrum and the noise characteristic, and 
outputting the calculated feature distribution parameter 

so 

25. The pattern recognition apparatus according to claim 24, wherein the feature distribution parameter is a parameter 
representing a distribution in a cepstrum space, a power spectrum space, or a spectrum magnitude space. 

26. The pattern recognition apparatus according to claim 24 or 25, wherein the feature extracting means further com- 
55 prises: 

data input section detecting means for detecting a data input section in which the input data is input and a 
data non-input section in which the input data is not input, and outputting a data section detection result; and 
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selecting means for selectively outputting the spectrum that is output from the spectrum analyzing means to 
the noise characteristic calculating means or the feature distribution parameter calculating means based on 
the data section detection result. 

27. The pattern recognition apparatus according to claim 26. wherein the noise characteristic calculating means out- 
puts data based on noise in the data non-input section. 

28. A pattern recognition method for recognizing a pattern of input data by classifying it as one of a predetermined 
number of classes, comprising: 



a framing step of extracting parts of the input data at predetermined intervals, and outputting each extracted 
data as 1 -frame data; 

a feature extracting step of receiving the 1 -frame data of each extracted data, and outputting a feature distri- 
bution parameter representing a distribution that is obtained when mapping of the 1 -frame data is made to a 
space of a feature quantity of the 1 -frame data; and 

a classilying step of classifying a series of f eatu re distribution parameters as one of the predetermined number 
of classes. 



29. The pattern recognition melhod according to claim 26, wherein the input data is speech data. 

20 

30. The pattern recognition melhod according to claim 2B or 29, wherein the feature extracting step comprises: 

a spectrum analyzing step of making an analysis of data including the 1 -frame data and oulputling the spec- 
trum; 

25 a noise characteristic calculating step of calculating and outputting a noise characteristic; and 

a feature distribution parameter calculating step of calculating a feature distribution parameter representing a 
distribution of the spectrum of the 1 -frame data based on the spectrum and the noise characteristic, and out- 
putting the calculated feature distribution parameter. 

30 31. The pattern recognition method according to claim 30. wherein the feature distribution parameter is a parameter 
representing a distribution in a cepstrum space.a power spectrum space or a spectrum magnitude space. 

32. The pattern recognition method according to claim 30 or 31 , wherein the feature extracting step further comprises: 

35 a data input section detecting step of detecting a data input section in which the input data is input and a data 

non-input section in which the input data is not input, and outputting a data section detection result; and 
a selecting step of selectively outputting, based on the data section detection result, the spectrum that is output 
by the spectrum analyzing step. 

40 33. The pattern recognition method according to claim 32 ; wherein the noise characteristic calculating step outputs 
data based on noise in the data non-input section. 
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FIG. 5 
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FIG.9 
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